EXPRESS MAIL LABEL 
NO. B95952030 


29T203/701 
09114 
November 23, 1986 


‘PATENT APPLICATION 
OF 


GUYSL as STEELE, JR. 
W. DANIEL HILLIS 
GUY BLELLOCH 
MICHAEL DRUMHELLER 
BREWSTER KAHLE 
CLIFFORD LASSER 
ABHIRAM RANADE 
JAMES SALEM 
KARL SIMS 


FOR 
VIRTUAL PROCESSOR TECHNIQUES 
IN A 


MULTIPROCESSOR ARRAY 


MECTECESS 60S i 


sO TAIT TT AT 
: 
99 / = rE 

7 


WS .QINSTe TD ‘ 
efauth {2c |e 
WoO2 15,18 £O9 
: eiqvoeniad IRAWIIN 4 a 
SIH48 RTOS 
ea62F.1 SFOs9L25 


STAVAR AAMTAGA ¥ 
M340 ANHAS is 
Bere J8AA 
aoe 


eevorwo? Hoke mass JAUPATY 
4 @i° 
L2RaA SOSUVORTI TAN 


29T203/701 
= 


VIRTUAL PROCESSOR TECHNIQUES IN A MULTIPROCESSOR ARRAY 
ee ee NE ERUL EE OOUR ARRAY 


Cross References to Related Applications 


Related applications are "Parallel Processor", Serial No. 
499,474 and “@arallel Processor/Memory Circuit", Serial No. 
499,471, both filed May 31, 1983; "Method and Apparatus for 
Routing Message Packets," Serial No. 671,835, filed November ifSiy 
1984 and now U.S. Patent No. 4,598,400, issued July 1, 1986. 
"Method and Apparatus for Interconnecting Processors in a 
Hyper-Dimensional Array," Serial No. 740,943, filed May 31, 
1985; "Method of Simulating Additional Processors in a SIMD 
Parallel Processor Array," Serial No. 832,913, filed February 
24, 1986; and "Pipelining Technique and Pipelined Processes for 
Multi-Dimensional Processor Arrays," filed on even date herewith 


by G. Blelloch et al; all of which are hereby incorporated by 
reference. 


Field of the Invention 


This invention relates to the field of parallel processing 
and multi-processing in digital computer systems. More 


Digitized by the Internet Archive 
in 2023 with funding from 
Kahle/Austin Foundation 


https://archive.org/details/virtualorocessorOOunse 
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particularly, it relates to a method of simulating and utilizing 
additional (i.e., virtual) processors in a single-instruction 
multiple-data (SIMD) parallel processor array. 


Background of the Invention 


In a SIMD computer, such as the Connection Machine (Reg. 
T.M. of Thinking Machines Corporation, Cambridge, MA) computer 
system, the architecture is designed to support a data parallel 
style of programming. In this style one programs assuming a 
separate processor for every data element, so that one may 
effectively operate on all data elements in parallel. 


The Connection Machine computer system supports such a style 
of programming by providing tens of thousands of individual 
hardware data processors, each with its own memory for holding a 
Gata element. (Current standard Connection Machine system 
configurations provide 16,384 processors and 65,536 
processors.) The data processors all process instructions 
issued on a centrally controlled instruction bus, so that at any 
given time all processors (or all processors in a large group) 
are executing the same instruction. The instruction bus is 
driven by a front end computer, which is a conventional 
Single-processor computer such as a Symbolics 3600 computer or a 
Digital Equipment Corporation VAX Computer. 


For example, if an ADD inStruction is issued, then all 
processors perform addition, each on its own data. (Most 
instructions are conditional, so that a flag bit in each 
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processor becomes an additional implicit input to the Operation, 
and the operation's results are stored only in processors whose 
flag bit is 1.) Many of the usual arithmetic and logic 
instructions found in contemporary computer instructions sets 
(such as SUBTRACT, MULTIPLY, DIVIDE, MAX, MIN, COMPARE, LOGICAL 
AND, LOGICAL OR, LOGICAL EXCLUSIVE OR, and floating-point 
instructions) are provided in this form; when one such 
instruction is issued, it is performed (possibly conditionally) 
by every hardware processor, each on its own data. 


Other computer systems of this general style have also been 
built. Prominent among these are the ICL DAP and the Goodyear 
MPP. A typical difficulty with these computer systems is that 
programming becomes much more complicated if the number of data 
elements in the problem to be solved exceeds the number of 
hardware processors. The Goodyear MPP, for example, provides 
16,384 hardware processors configured in a 128 x 128 two- 


dimensional grid. If a problem requires the processing of 


200 x 200 elements (total 40,000), the programming task is much 


more difficult because one can no longer assign one data element 
to each processor, but must assign-two data elements to some 
processors. Even if a problem requires no more than 16,384 data 


elements, 1f they are°to be organized as a 64 x 256 grid rather 


than a 128 x 128 pattern, programming is again complicated, this 


time because the problem communication structure does not match 
the hardware communication structure, 
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Summary of the Invention 


The instruction set of the Connection Machine computer 
system alleviates these difficulties by supporting a virtual 
processor mechanism at the lowest level of implementation. The 
virtual processor mechanism causes every physical hardware 
processor to be used to simulate multiple "virtual" processors. 
Each physical processor simulates the same number of virtual 
processors; the number of virtual procesors simulated by each 
physical processor is called the VP-ratio. The VP-ratio is 
software selectable and is determined when the Connection 


Machine is initialized (i.e., "cold-booted") before running a 
user application, 


Let n stand for the VP-ratio at a given point in time. The 
virtual processor mechanism causes the memory of each physical 
processor to be divided into n regions of equal size; each such 
region is the memory for OnePVirtual’ processorwsetThervset! of 
virtual processors whose memory is stored the same relative 
position within each physical memory is called a VP-bank. If 
the VP-ratio is n, then there are n VP-banks. A VP-bank is a 


set of virtual processors than can be serviced simultaneously by 
the physical processors. 


Whenever an instruction is processed, each physical 
processor is time-sliced among the virtual memory regions, 
performing the operation first as one virtual processor, then 
another, until the operation has been performed for all virtual 


processors. Only then is the next instruction accepted from the 
instruction bus. 
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Superficially this is similar to a multiprocessor system in 


which each physical processor is time-shared among several 


virtual processors, However, the virtual processor mechanism 


Gescribed herein differs from conventional time-sharing 
techniques (multiprocessor or otherwise) in two important 
respects: First, in a conventional time-sharing system, the 
switching of a processor among virtual processes typically 
occurs at unpredictable times dictated by asynchronously 
generated events such as interrupts from a real-time clock. By 
contrast, the present virtual processor mechanism switches each 
physical processor among the virtual processors in a completely 
regular, predictable, deterministic fashion. Second, ina 
conventional time-sharing system the switching of a processor 
among virtual processes typically occurs between instructions; 
the physical processor executes some instructions on behalf of 
one virtual process, and then, after a switch, executes some 
instructions on behalf of another virtual process. The present 
virtual processor mechanism, by contrast, switches among virtual 
processors within instructions; at the completion of each 


instruction, that instruction has been executed on behalf of all 
virtual processors. 


The model of simply time-slicing physical processors among 
virtual processors in a round-robin fashion is a simple one, and 
suffices to explain instructions for which no interprocessor 


communication occurs. Interesting complications occur when 


communication is involved. In the description below, we examine 


in greater detail the processing of five specific instructions 
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in the Connection Machine System instruction set that are 
illustrative of the handling of virtual processors: ADD, 
GLOBAL-ADD, PLUS-SCAN, GET-FROM-EAST, and SEND. In each case it 
is assumed that the operation has already been implemented for 
physical processors; we then exhibit an implementation of the 
Operation for virtual processors. The detailed description 
should be read in conjunction with the several sheets of 


accompanying drawing, which is incorporated by reference herein. 


Brief Description of the Drawing 


In the drawing, 


Fig. 1 is an illustration of the general behavior of a 


Single ADD instruction without the virtual processor mechanism 
of the present invention; 


Fig. 2 is an illustration of the general behavior of a 


Single ADD instruction utilizing the virtual processor mechanism 
of the present invention; 


Fig. 3 is an illustration of the operation of the ADD 
instruction of Fig. 2 for eight hardware processors operated as 
four virtual processor banks in accordance with the present 
invention, showing the status of the virtual processor banks 
before the ADD instruction adds B into A; 


Fig. 4 is a further illustration of the operation of the ADD 
instruction showing the states of the VP banks after VP-bank 0 
has been processed; 
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Fig. 5 is a further illustration of the operation of the ADD 


instruction showing the states of the VP banks after VP-bank 1 
has been processed; 


Fig. 6 is a further illustration of the operation of the ADD 


instruction showing the states of the VP banks after VP-bank 2 
has been processed; 


Fic -s7ei5 a rurtner ai iusctration of the operation of the ADD 


instruction showing the states of the VP banks after VP-bank 3 
has been processed; 


Fig. 8 is an illustration of the methods steps for executing 
the ADD instruction for virtual processors, in accordance with 
the present invention, optimized to take advantage of the 
availability of a global-OR bit; 


Fig. 9 is an illustration of the method steps for executing 
the global-ADD instruction for virtual processors, in accordance 
with the present invention; 


Prd eelOmismanel Ulusthation OL, the Steps fol executing the 
global-ADD instruction for virtual processors in accordance with 
the present invention, optimized to take advantage of the 
presence of a global-OR bit; 


Figs. 11-17 illustrate, in succession, the contents of the 
virtual processors initially and at each successive step during 
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the execution of the global-ADD instruction, in accordance with 
the present invention; 


Fig. 18 illustrates the method steps for executing the 


PLUS-SCAN instruction for virtual processors in accordance with 
the present invention; 


Figs. 19-29 diagrammatically illustrate, in succession, the 
contents of each virtual processor before a PLUS-SCAN 
instruction from field b to field a, and at each successive step 
in the execucion of that ‘instruction; 


Fig. 30 illustrates the method steps for executing the 
GET-FROM-EAST instruction for virtual processors, in accordance 
with the present invention; 


Fig. 31 illustrates a variation on the method of Fig. 30 
wherein fields a and b are to be the same field and only scratch 
area is used within each physical processor; 


Figs. B2—joeu tlustrate wn ssuccessi On ,.tne contents of each 
virtual processor, both before a GET-FROM-EAST instruction from 
field b to field a and at each successive step during the 
execution of that instruction; 


Fig. 37 diagrammatically illustrates the two-dimensional 
grid into which the eight physical processors PO-P7 are 
organized for the operation of Figs. 32-36; 
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Fig. 38 diagrammatically illustrates the grid into which the 
virtual processors in banks 0-3 are organized for use, in 
cooperation with the physical grid of Fig. 37, for execution of 
the GET-FROM-EAST instruction in accordance with Fig. 32-36; 


Fig. 39 diagrammatically illustrates the overall virtual 
NEWS GRID represented by the combination of Figs. 37 and 38; 


Fig. 40 is a diagrammatic illustration of the message format 
Supplied to the virtual router in accordance with the present 


invention, for execution of the SEND instruction; 


Fig. 41 illustrates the method steps for execution of the 


SEND instruction for virtual processors, in accordance with the 
present invention; and 


Fig. 42 illustrates the method steps for execution of an 
improved SEND instruction for virtual processors, 
with the present invention. 


in accordance 


Detailed Description of Illustrative Embodiments 


Example: the ADD operation 

The ADD operation causes every virtual processor to add one 
memory field to another, causing the second one to be altered to 
contain the sum. Carry and overflow from the operation are 
recorded also. There are flag bits associated with each virtual 
processor. Call the total number of flag bits f:; in the current 
Connection Machine computer implementation, f = 4. Two of these 
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flag bits (i.e., carry and overflow) function as condition code 
bits of the usual sort, recording results of operations; 
another, called context, is the bit that controls whether 
conditional operations will store their results. 


If the VP-ratio is n, then the memory of a physical 
processor is partitioned into n regions of identical size, but 
there is more to the story than that. First of all, a certain 
amount r of each physical processor's memory is set aside for 
per-physical-processor housekeeping purposes. Second, four bits 
of memory must be set aside to hold the flag bits of each 
virtual processor. If mis the total amount of memory per 
physical processor (4,096 bits in the current implementation), 


then the amount of memory, v, set aside for each virtual 
processor is 


n 


of which f bits serve as the simulated flag bits of the virtual 
processor and v - f bits serve as the simulated memory. The v 
bits of per-processor physical memory set aside for each virtual 
processor may (but need not) be continguous. If these V bits 
are continguous, the first v - f bits may be used for the 
simulated memory, and bits v - 4 through v - 1 of each block may 
contain the simulated flags. In this arrangement, memory bit j 
of virtual processor k is stored at physical address kv + j. 


Other arrangements are possible, including one where the bits of 
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all virtual processors are interleaved and therefore memory bit 
j of virtual processor k is stored at physical address jn + k. 


Without the virtual processor mechanism, the general 
behavior of a single ADD instruction issued by the front end 


computer with operand addresses a and b would be as shown in 
Fae 


With the virtual processor mechanism, the general behavior 
of that same single ADD instruction issued by the front end 
computer is shown in Fig. 2.° Note that the test flag is not 
loaded or stored as it is not used by the ADD instruction, and 
the context flag is not stored back because it is not used by 
the ADD instruction. Other instructions, of course, load and 
store different flags according to need. 


Frus. S-7eiliustrate the Operation Of the*ADD anstruction 
for eight hardware processors and a VP-ratio of four (i.e., 
there are four VP-banks providing 4 x 8 = 32 virtual 
PLoce=S0lS) -smisetoece, f1gures, ene context flags represented 
as a bullet (e) for the value 1 or as a circle (0) for the value 
0. Remember that results are stored only in virtual processors 
whose context flag is 1. The carry and overflow flags are not 
represented in the figures. In each of Figs. 4-7, an arrow (€) 
indicates the places where that figure differs from the 
preceding one. For uniformity with later figures, the 
per-physical-processor housekeeping area is shown even though it 
is not used for the ADD instruction. 
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A refinement of this technique allows conditional 
instructions of this sort to be executed much more efficiently 
in some cases, There is a hardware mechanism in the Connection 
Machine system that can compute the logical OR of one bit from 
every hardware processor. This result is called the global-OR 
bit. For each VP-bank, the global-OR of the context flags for 
all virtual processors in that bank can be checked quickly; if 
the global-OR bit is zero, then no virtual processor in that 
bank will store results, and the labor of computing the sums can 
be avoided, as shown in Fig. 8. For some user applications this 
technique can yield a significant speed improvement. 


Example: GLOBAL-ADD 

Without virtual procesors, the GLOBAL-ADD inStruction causes 
one integer from each hardware processor whose context flag is 
set to be contributed to an accumulating sum; the total sum of 
all such integers is reported to the front end computer. The 
memory and flag bits of the individual processors are not 
changed (except for scratch space in a reserved housekeeping 
area of memory not visible to the user, that is, front end 
computer). 


For purposes of exposition let us call the foregoing 
operation PHYSICAL-GLOBAL-ADD. The effect of the GLOBAL-ADD 
instruction on the field at virtual address b under the virtual 


processor mechanism may then be described as shown in Fig. 9. 


The same VP-bank skipping technique deScribed above for the 
ADD instruction may be used to potentially speed up GLOBAL-ADD 
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as well (see Fig. 10). (For the remainder of this description 
we will not include or mention this technique further; it is 


easily incorporated into other operations where appropriate. ) 


Figs. 11-17 illustrate the operation of the conditional 


GLOBAL-ADD instruction for eight hardware processors operated as 
four VP-banks. 


Example: PLUS SCAN 

For the discussion of ADD and GLOBAL-ADD it was not neceSary 
to distinguish one virtual processor from another, as they all 
participate equally and indistinguishably in those operations. 


Other operations do distinguish among virtual processors, though. 


Every physical processor has a distinct processor number, or 
address. In a 65,536-processor Connection Machine system, the 
physical processors are numbered from 0 to 65,535. (Note that 


65,536 = nd SO a processor number may be represented in 16 


bits, for example.) 


The virtual processor mechanism also assigns .a Gistince 
number to each virtual processor. The VP-banks are numbered 
from 0 to n- 1. The virtual processor in physical processor j 
and in VP-bank k then has virtual processor number jn+ k. (In 
the current implementation, the Vp= Att Omi SeLestri clea .tO De an 
integral power of 2. This is primarily for implementation 
convenience, so that the physical processor number is simply one 
bit field of the virtual processor number that can be extracted 
without performing a division operation). 
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Without virtual processors, the unconditional PLUS-SCAN 
instruction with operand. addresses a and b causes field a in 
hardware processor j to receive the integer sum of fields b from 
every hardware processor whose processor number k is strictly 
less than j. Call this PHYSICAL-PLUS-SCAN-ALWAYS. Its 
conditional analogue, PHYSICAL-PLUS-SCAN, causes field a in 
hardware processor j to receive (provided that the context flag 
of processor j is 1) the integer sum of fields b from every 
hardware processor whose processor number k is strictly less 
thane) andewhose contexteflaguis 1; ~The tfliag bits,of, the 
individual processors are not changed. 


The operation of the conditional PLUS-SCAN operation on 
Virtual yprocessors 4s illustrated;in Fig. 18... (The 
unconditional version is obtained simply by changing any and all 
conditional subsidiary operations to be unconditional; all 
loading of the hardware context flag may then be eliminated.) 
The first part (steps (A) and (B)) resembles the first part of 
PhesGuORAL-ADDta Laoruthme@for “Fig.e9s Bln step \(C); gan 
unconditional plus-scan on physical processors is performed . 
Steps (D) spread results back to the virtual processors within 
each physical processor. 


Figs. 19-29 illustrate the operation of the PLUS-SCAN 
instruction for eight hardware processors operated as four 
vP-banks. In this example, the context flag of every processor 
is 1, so the operation is effectively unconditional. Observe 
that at the end of the operation, the field r in the 
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highest-numbered physical processor contains the total sum of 
fields b from all virtual processors. | 


Example: GET-FROM-EAST 

The hardware NEWS grid of the Connection Machine organizes 
physical processors in a regular manner into @ two-dimensional 
grid wherein each processor communicates with its north, east, 
west and south neighbors. A given physical processor's x and y 
coordinates within the grid can be calculated from its processor 
number. We will write these functions of a processor number j 
as X(j) and Y(j), respectively. If the total number of 
processors is P, and Py and Py are the dimensions of the 


hardware grid, then ee yo P. Given the x and y coordinates 


of a physical processor, we can calculate its processor number 
RorAx, Yew soIerenoreawe shave *P(x( 7), 0) )) e959 PR xCeCxiy yy = x, 
andy P(x 7a) “=fy: 


The hardware grid is so organized that every physical 
processor has four neighbors. The neighbors of processor j are 
processors with numbers P(X(j-1) mod PldrZ(j)), P(X((3+1) mod 
Pl )e¥(3)), PCX(3),¥((j-1) mod Pi)) and P(X(3),Y((j+1) -mod 
Py))- These are respectively called the neighbors to the 
West; East, North and South. There are instructions, one for 
each of the four directions, that cause every processor to 
receive data from its neighbor in that direction; the processor 
then stores that data into memory. Each instruction comes in 
conditional and unconditional varieties; the conditional form 
causesieach) processor’ ito store®the’ incoming idata ‘only if its 
context @lagiaswe 
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The virtual processor mechanism organizes the virtual processors 
within a physical processor into a small grid. If the VP-ratio 
is n, then the small grid may be of shape ny by nye 
ny and ny are any two integers such that nny = ini. Piz 
virtual processor in VP-bank k has coordinates X(k) and 
V(k) within this small grid. 


where 


The small grids within the physical processors are conjoined 
like the patchwork squares in a quilt to make one large grid of 
size hte by Pyny: The virtual processor in VP-bank k 
of physical processor j has, as already stated above, virtual 
processor number jn+k. That same virtual processor then has 
Virtual NEWS coordinates X(j)ny + X i(k) and Y(j)ny + 
» rh G9 The virtual processor with virtual NEWS coordinates 


Ke and Mey has virtual processor number le ee CaN 


When data is received from a virtual neighbor to the East, 
for example, virtual processors that are on the East edge of 
their small grid will need to receive data that is in another 
physical processor; all other virtual processors receive data 


from virtual processors that are within the same virtual 
processor. 


Fig. 30 depicts the general behavior of the instruction 
GET-FROM-EAST, with the virtual processor mechanism, sending out 
Gata from field b and receiving it into the distinct field a. 
(The unconditional version is obtained simply by changing any 
and all conditional subsidiary operations to be unconditional; 
all loading of the hardware context flag may then be eliminated). 
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If the virtual fields a and b are to be the same field, an 
extra scratch field may be used within each virtual processor. 
Or a single scratch area may be used within each physical 
processor, thereby conserving memory, as shown in Fig. 3l. 
(The unconditional version is obtained simply by changing any 
and all conditional subsidiary operations to be unconditional; 
all loading of the hardware context flag may then be eliminated.) 


Figs. 32-36 illustrate the operation of the GET-FROM-EAST 
instruction for eight hardware processors operated as four 
vVP-banks, using the method of Fig. 30 to put a shifted copy of 
field b into field a. Assume that the physical processors PO 


through P7 are organized into a physical grid as shown in Fig. 
a2. 


Assume further that within each physical processor the 
virtual processors in banks 0 through 3 are organized into a 
small..graid. as shown in Fig..38 3 |.The overall virtual NEWS ‘grid 
therefore is represented by Fig. 39. In the example shown in 
Figs. 32-36 the context flag of every processor is 1, so the 
operation is effectively unconditional. 


Example: SEND 

The physical-processor version of the SEND instruction 
relies on special purpose hardware (the router detailed in the 
above-noted patent No. 4,598,400 and application no. 499,474, 
both incorporated by reference herein) to transmit messages from 


one processor to another. The actual hardware implementation is 
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quite complex, but for the purposes of this discussion may be 


summarized abstractly as follows. Each processor is attached to 


a router node. (In the actual hardware each router node 
services several processors, but that fact is not important 
here.) Suppose that physical processor g (the sender) is to 
send message m to physical processor h (the destination). 
Processor g furnishes to its router the physical processor 
number of h and the message data m (effectively concatenated 
into one long string of bits). The router node then forwards 
the data m to the router node connected to physical processor 
h. That router then stores m into the memory of physical 
processor h. Of course, while all this is going on other 


physical processors may also have requested delivery of other 
messages to other destinations. 


There is an additional piece of hardware (the "message 
d@etector" of Patent No. 4,598,400, incorporated by reference) 
that, among other functions, can cause the delivery of a message 
to its destination to be conditional upon the message data. in 
our example, it might be specified that only messages whose 
first three bits are 101 are to be stored into the memories of 
their respective destination processors. 


remain buffered within the router nodes. 


All other messages 

It might next be 
specified that messages whose first three bits are 110 are to be 
stored, and SO on. 


For each such directive (to store only certain messages) a 


different memory address may be supplied. To go over the same 


example in more detail, it might be specified that only messages 
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whose first three bits are 101 are to be stored, and they are to 
be stored at address 2570. It might next be specified that 
messages whose first three bits are 110 are to be stored at 
address 3082; next that message whose first three bits are lll 
are to be stored at adddress 3594; and so on. 


As seen from a physical processor, then, message 
transmission may be abstractly divided into two phases: 
(1) inject a message into the router, perhaps conditionally on 
the context flag; (2) receive a message from the router, if one 
has been sent and if the message detector permits. 


The message detector makes it possible to support the 
efficient transmission of messages among virtual processors. If 
virtual processor u is to send message m to virtual processor w, 
it provides to the "virtual router" the virtual processor number 
of w and the message data m (effectively concatenated into one 
long string of bits). Because in the current implementation the 
VP-ratio is constrained to be a power of two, a virtual 
processor address is simply the concatenation of a physical 
processor address and a VP-bank number. The message information 
supplied to the virtual router can therefore be interpreted by 
the physical router hardware as consisting of a physical 
processor number followed by a somewhat longer message data 
string (which consists of a VP-bank number followed by the 


virtual message data m), as illustrated in Fig. 40. 


The method therefore requires the delivery of the message to 
the physical processor that contains w and then the use of the 
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message detector to ensure that the message data is stored at a 
location within the memory set aside for the VP-bank that 
contains w. This method for sending data from field b to field 


a in the virtual processor specified by field d is outlined in 
Fig. 41. 


The method shown in Fig. 41 unfortunately takes a number of 
steps proportional to the square of the VP-ratio. In practice, 
a more complex method may be used. The latter takes advantage 
of the buffering within the physical router, plus three 
additional features of the router hardware not yet discussed: 
(1) Injection and storing of messages may take place 
concurrently. (2) It is not always possible, because of buffer 
limitations, for every physcial processor to inject a message 
simultaneously. Instead, the protocol is that every physical 
processor may attempt to inject a message, and the router 
returns to the processor a bit that is 1 if the message was 
injected and 0 otherwise. (3) The global-OR facility may also 


be used to determine whether any router node still has messages 
buffered within it. 


The improved method is shown in Fig. 42. It assumes that 
there are two additional one-bit fields called c and e in each 
virtual (not physical) procesor. In the worst case this methods 
can still take time proportional to the square of the VP-ratio, 
but empirical results show that in practice its performance is 
considerably better than the performance of the method in Fig. 


41. Of course, many other strategies are possible with the 
given hardware. 
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Summary 


eg i 


The five instructions ADD, GLOBAL-ADD, PLUS-SCAN, SEND, and 
GET-FROM-EAST are handled by the instant virtual processor 
mechanism in a variety of ways: 


)) 


(2) 


(3) 


The ADD instruction simply iterates over VP-banks, 
performing the operation in each bank. The virtual 
processors do not interact. 


The GLOBAL-ADD instruction obtains the effect of 
gathering up information from all processors by 
gathering up partial results from all virtual 
processors within each physical processor and then 


gathering up a higher level or partial results from all 
physical processors. 


The PLUS-SCAN inStruction obtains the effect of a 
prefix-sum operation over all virtual procesors in 
three sets: (1) summarizing the data for all virtual 
processors within each physical processor, 

(2) performing a prefix-sum over physical processors, 
and then (3) spreading the results within each physical 
processor back to the virtual processors, 


The GET-FROM-EAST instruction performs some 
inter-virtual-processor data movement within each 
physical processor and uses the physical grid 
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communicaton mechanism to accomplish the remaining data 
movement. 


(5) The SEND instruction takes advantage of special 


buffering and pattern-matching mechanisms in the router 
hardware. 


Despite the variety of mechanisms used in the implementation, 


the overall goal is the same in each case: to support the 
appearance to the front end computer of a much larger number of 


processors than are actually implemented discretely in hardware. 


Having thus described the virtual processor mechanism and a 
number of instructions specifically adapted for use with that 
mechanism, it will be readily apparent that alterations, 
modifications and improvements thereto will readily occur to 
those familiar with the art. Such obvious modifications, 
alterations and improvements are intended to be suggested by 
this disclosure and are therefore within the spirit and scope of 
the invention. Accordingly, the foregoing description is 
intended to be exemplary only, and not limiting. The invention 
is limited only as defined by claims appended hereto, and by 
their equivalents, 
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CLAIMS 


1. In a Single-instruction multiple data (SIMD) parallel 
processor comprising a controller and an array of substantially 
identical physical processors controlled in parallel by said 
controller, each processor comprising an input, an output, a 
processing element and a memory element associated with each 
processing element, the processing element operating on data 
provided by its input and associated memory element, in 
accordance with instructions provided by said controller, to 
produce data at its output, a method of simulating the presence 
of a larger mumber of processors in the array than the number of 
said physical processors, thereby to provide a corresponding 
number of so-called "virtual processors," and of utilizing said 
virtual processors, comprising the steps of: 


(a) subdividing the memory elements asociated with each of 
a plurality of physical processing elements in 
jdentical fashion to form a plurality of sub-memories 


associated with each processing memory, each of v bits 
in length; 


(b) providing at least a first instruction from the 
controller to a set of the physical processors to cause 
the processing elements thereof to process data stored 
at a first location in a first sub-memory associated 
with each such processing element; 
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(c) 


(d) 


S245 


at a subsequent time within the time allowed for the 
execution of said first instruction, providing said 
first instruction from the controller to such set of 
physical processors to cause the processing elements 
thereof each to process data stored at the same first 


location in a second sub-memory associated with the 
processor; and 


providing for each virtual processor a number of flag 
bits, f, first and second ones of said flag bits 
functioning as condition code bits, recording the 
results of operations, and a third flag bit controlling 


whether conditional operations will store their results. 


2. The method of claim l wherein the instruction is an ADD 


instruction intended to cause the virtual processors to add 


together a multiplicity of numbers, and wherein the set of 


virtual processors whose memory is stored in the same relative 


position within each memory element is termed a "VP-bank," such 


method 


(e) 


for each VP-bank, (i) loading virtual carry, over flow 
ana context flags for VP-bank k into respective 
hardware flag bits, (2) conditionally adding the field 
at location kv + b into the field at kv + a, where a 
and b are operand addresses, setting hardware carry and 
overflow flag bits as appropriate, and storing hardware 


carry and overflow flags back into virtual flags for 
vP-bank k. 
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VIRTUAL PROCESSOR TECHNIQUES IN A MULTIPROCESSOR ARRAY 
ABSTRACT 


A virtual processor mechanism and specific techniques and 
instructions for utilizing such virtual processor mechanism 
within an SIMD computer having numerous processors, and each 
physical processor having dedicated memory associated 
therewith. Each physical processor is used to simulate multiple 
"virtual" processors, with each physical processor simulating 
the same number of virtual processors. The memory of each 
physical processor is divided into n regions of equal size, each 
such region being allocated to one virtual processor, where n is 
the number of virtual processors simulated by each physical 
processor. Whenever an instruction is processed, each physical 
processor is time-sliced among the virtual memory regions, 
performing the operation first as one virtual processor, then 
another, until the operation has been performed for all virtual 
processors. Physical processors are switched among the virtual 
processors in a completely regular, predictable, deterministic 
fashion. The virtual processor mechanism switches among virtual 
processors within instructions, so that at the completion of 
each instruction, it has been executed on behalf of all virtual 
processors. A number of instructions are shown for execution 
using these virtual processor techniques. 
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All hardware processors perform in parallel: 
Conditionally add field at b into field at a, 
setting carry and overflow flags 


Figure 1: Method for ADD instruction for a physical processor 


All hardware processors perform in parallel: 
Fork =0,1,2,...,.n-—1do 
[Processing of the k’th VP-bank] 
Load virtual carry, overflow, and contezt flags for VP-bank k into hardware flags 
Conditionally add field at ku + b into field at ku+a, 
setting hardware carry and overflow flags 
Store hardware carry and overflow flags back 
into virtual flags for VP-bank k 


Figure 2: Method for ADD instruction for virtual processors 
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Figure 5: After the ADD instruction has processed VP-bank 1] 
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Figure 6: After the ADD instruction has processed WP-bank 2 
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Figure 7: After the ADD instruction has processed VP-bank 3 
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All hardware processors perform in parallel: 
For z= 0,1,2,...,n—1 do 

|Processing of the k’th VP-bank| 

Load virtual contezt flag for VP-bank k into hardware flag 

Hf the global-OR of the hardware contezt flag is 1: 
Load virtual carry and overflow flags for VP-bank k into hardware flags 
Conditionally add field at 6+ z into field at a + z, 

setting hardware carry and overflow flags 

Store hardware flags back into virtual carry and overflow flags 


Figure 8: Optimized method for ADD instruction for virtual processors 
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All hardware processors perform in parallel: 
(A) Clear a field g in the per-physical-processor housekeeping area 
For. k.= 0,1,2,4..\n—}'do 
Load virtual contezt flag for VP-bank k into hardware flag 
(B) Conditionally add field at ku + b into field at ¢ 
(perhaps setting hardware carry and overflow flags, 
but that is irrelevant to this operation) 
(C) Perform PHYSICAL-GLOBAL-ADD on physical address g 


Figure 9: Method for GLOBAL-ADD instruction for virtual processors 


All hardware processors perform in parallel: 
(A) Clear a field g in the per-physical-processor housekeeping area 
For2=0,172°"4, n — Ido 
Load virtual contezt flag for VP-bank k into hardware flag 
If the global-OR of the hardware contezt flag is 1: 
(B) Conditionally add field at ku + into field at ¢ 
(perhaps setting hardware carry and overflow flags, 
but that is irrelevant to this operation) 
(C) Perform PHYSICAL-GLOBAL-ADD on physical address g 


Figare 10: Optimized method for GLOBAL-ADD instruction for virtual processors 
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Figure 13: A GLOBAL-ADD instruction after step (B) for VP-bank 0 
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Figure 14: A GLOBAL-ADD instruction after step (B) for VP-bank 1 
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Figure 15: A GLOBAL-ADD instruction after step (B) for VP-bank 2 
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Figure 16: A GLOBAL-ADD instruction after step (B) for VP-bank 3 
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PHYSICAL-GLOBAL-ADD-ALWAYS 
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Figure 17: A GLOBAL-ADD instruction after step (C) 
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All hardware processors perform in parallel: 
Clear a field g in the per-physical-processor housekeeping area 
For k= 0,1,2,...,n—1do 
Load virtual contezt flag for VP-bank k into hardware flag 
Conditionally add field at ku + 6 into field at g 
(perhaps setting hardware carry and overflow flags, 
but that is irrelevant to this operation) 
Perform PHYSICAL-PLUS-SCAN-ALWAYS from physical address g 
to another field r in the per-physical-processor housekeeping area 
All hardware processors perform in parallel: 
For k= 0,1,2,...,n—1do 
Load virtual contezt flag for VP-bank k into hardware flag 
Conditionally copy field at r to field at ku+a 
Conditionally add field at kv + 6 into field at r 
(perhaps setting hardware carry and overflow flags, 
but that is irrelevant to this operation) 


Figure 18: Method for PLUS-SCAN instruction for virtual processors 
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Figure 28: A PLUS-SCAN instruction after step (D) for VP-bank 2 
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Figure 29: A PLUS-SCAN instruction after step (D) for VP-bank 3 
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All hardware processors perform in parallel: 
Fon kos 0,1,2,...,n,-—1do 
For ke = 0,112.80 11-10 do 
Load virtual contezt flag for VP-bank P..(ks, ky) mod n into hardware flag 

(A) Conditionally copy field at P. (kz + 1,k,) + b to field at Pi.(ks,ky) +a 

Load virtual contezt flag for VP-bank P,(n, — 1,k,) mod n into hardware flag 
(B) Conditionally perform PHYSICAL-GET-FROM-EAST 

from field at P,,(0,ky) +6 to field at Pu(n, -l,ky) +a 


Figure 30: Method for GET-FROM-EAST instruction for virtual processors 


All hardware processors perform in parallel: 
For k= 0,1,2,...,n, —1do 
Copy field at P,,(0,k,) +6 to field g in housekeeping area 
Korky = 0; 1,25. nt) do 
Load virtual contezt flag for VP-bank P.(k;, ky) mod n into hardware flag 
Conditionally copy field at P,(k, + 1,ky) + 6 to field at P..(ks, ky) + a 
Load virtual contezt flag for VP-bank P,(n, — 1, ky) mod n into hardware flag 
Conditionally perform PHYSICAL-GET-FROM-EAST 
from field at g to field at P.(n, — lkyjt+a 


Figure 31: Method for GET-FROM-EAST instruction for virtual processors 
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Figure 33: A GET-FROM-EAST instruction after step (A) for ky = 0,k, = 0 
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Figure 34: A GET-FROM-EAST instruction after step (B) for ky = 0 
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Figure 35: A GET-FROM-EAST instruction after step (A) for ky =1,k.=0 
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Figure 36: A GET-FROM-EAST instruction after step (B) for ky = 1 
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All hardware processors perform in parallel: 
Foukit—= 0,1.) ent) do 
Load virtual contezt flag for VP-bank k» into hardware flag 
Conditionally inject fields kyu+d and kyv +b into the router 
For ka=051/2) on = 1 do 


Store messages whose first log, n bits equal k> into field kou+a 


Fic Y)}: Method for SEND instructions for virtual processors 


All hardware processors perforin in parallel: 
For k= 0,1,2,...,n—1do 
Copy context flag for VP-bank k to field ku+c 
While (any virtual processor has its context flag set) 
or (any router mode has buffered messages) do 
All hardware processors perform in parallel: 
Fork =O) le 2ee = 9 — ih da 
Load virtual contezt flag for VP-bank k 
Do these two things simultaneously: 
Conditionally attempt to inject fields ku+ dand kv+b 
into the router, storing returned “success” bit into field kv + e 
Store messages whose first log. n bits equal k into field ku+a 
Conditionally load logical NOT of field kv + e into hardware context flag 


Unconditionally store hardware context flag into flag for VP-bank k 
All hardware processors perform in parallel: 
kor/k =. 051,25 — 1 do 


Copy field kv + ¢ to context fiag for VP-bank k 


into hardware flag 


FlG@ 42 Method fer sexp instructions for virtual processors 
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